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Abstract 

Data distribution is an important step in implementation of any parallel algorithm. 
The data distribution determines data traffic, utilization of the interconnection net- 
work and affects the overall code efficiency. In recent years a number data distribu- 
tion methods have been developed and used in real programs for improving data 
traffic. We use some of the methods for translating data dependence and affinity 
relations into data distribution directives. We describe an automatic data alignment 
and placement tool (ADAPT) which implements these methods and show it results 
for some CFD codes (NPB and ARC3D). Algorithms for program analysis and 
derivation of data distribution implemented in ADAPT are efficient three pass 
algorithms. Most algorithms have linear complexity with the exception of some 
graph algorithms having complexity 0(n 4 ) in the worst case. 
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1. Introduction 

Well organized data layout improves performance of a parallel pro- 
gram. Data location and access patterns affect the amount of communica- 
tions in the program, effectiveness of the cache, memory channel and 
communication network. Data traffic planning was important for gaining 
performance of vector and MIMD machines; it is still important for ccNU- 
MA machines and will be more critical for machines with deeper memory 
hierarchy. 

Processing dependent data items requires the items to be loaded into 
the same processor at the same time. Data dependence and data layout de- 
termine the data traffic at the execution time. The volume and speed of 
data traffic can be optimized by appropriate alignment and distribution of 
the data. 

In many applications (including Computational Fluid Dynamics 
(CFD) on structured grids) the data dependences are well structured. 
Such dependences can be expressed by structured affinity relations be- 
tween arrays and can be translated into HPF (Fligh Performance Fortran) 
data alignment and data distribution statements. Derived in this way, 
data distributions convert the program into a data parallel form with well 
organized data traffic. 

In this paper we present an Automatic Data Alignment and Place- 
ment Tool (ADAPT) designed and implemented to facilitate conversion of 
CFD codes to HPF. The tool was successfully applied to a number of CFD 
codes, including NAS Parallel Benchmarks (NPB), [4], and ARC3D, [8]. 
ADAPT implements a number of known data distribution techniques (see 
Section 9.), involving memory traffic reduction via data distribution opti- 
mization. We start from the program model and an internal representa- 
tion of the program. Then we analyze the array affinity on the loop level, 
nest level and routine level and show how to translate the affinity relation 
into HPF data mapping directives and discuss interprocedural data distri- 
butions. Then we compare the data mapping directives generated by 


2 



ADAPT with data mappings used in HPF versions of NPB2.3 and 
ARC3D. We conclude the paper with a survey of exiting data distribution 
methods, conclusions, and plans for future work. 

2. Program Model 

ADAPT annotates FORTRAN programs with data traffic optimizing 
data alignment and distribution directives (we will use the word "map- 
ping" to refer to both directives) and with HPF interface blocks. ADAPT 
inserts directives either immediately before loop nests or before the first 
executable statement in a subroutine. These directives do not affect con- 
trol flow of the program on single processor. The directives, however, can 
have side effects on parallel machines if processors are not synchronized 
before and after the execution of the REALIGN and REDISTRIBUTE (see 
[[14]]) directives. We will assume that the program is compiled with an 
HPF compliant compiler before execution and that the processors in- 
volved in the execution are synchronized before and after REALIGN and 
REDISTRIBUTE directives. 

For analysis and transformation purposes, a FORTRAN program is 
represented by control a flow graph [1]. The nodes of the graph represent 
program basic blocks (BB) and arcs represent possible transitions between 
BB. The source program will be parsed into a control flow graph augment- 
ed with parse trees for each statement. ADAPT modifies the control flow 
graph from which the resulting annotated program will be generated. 

We reformulate the definition of dependence between grammar at- 
tributes by saying "data item x depends on data item y if value of x de- 
pends on value of y" (see [1] p. 284,). If the data items are defined by 
variables X and Y respectively, we say that "X depends on Y" . If X and Y 
are arrays, we call the set of pairs of dependent array elements " affinity re- 
lation between X and Y". The affinity relation can be translated into array 
alignment. The appropriate distribution of aligned arrays reduces the 
data movement performed by the program. 
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3. Loop Data Alignment 

In this section we consider translation of data dependence on the loop 
level into data alignment. Loop level alignment is derived from the affinity 
relation between arrays referenced in the loop, and the loop Data Transfer 
Graph (DTG). 

Affinity relation. For a pair of arrays used in the same loop statement, 
we define the affinity relation as a correspondence between array ele- 
ments referred with the same value of the loop index. Let arrays a and b 
are accessed with index functions idxa and idxb respectively. The affin- 
ity relation can be represented as a list of dependent pairs: 
do i = 1 , n 

a ( idxa ( i ) ) =b ( idxb ( i ) ) 
end do 

c Af f (a,b) ={ (idxa(i) ; idxb(i) ) , i=l,...,n} 

If there are multiple arrays in the statement, then for each pair of arrays, 
such relation exists. Similarly, a control dependence results in affinity re- 
lations between the arrays involved in the control statement and all arrays 
in each BB immediately dominated by the statement. 

Data Transfer Graph L Each variable used in the loop is represented as 
a node in DTG. As shown in Figure 1, two nodes are connected by an arc 
if the value of the first variable is used for computation of the second (in 
other words, a data item described by the second variable depends on a 
data item described by the first). We annotate each DTG arc connecting 
two arrays with an affinity relation between the arrays. For array A and 
any of its ancestor B in the DTG an affinity relation between A and B can 
be inferred by applying the affinity chain rule along each directed path 
from B to A. 


1. The term Data Transfer Graph is used to avoid a confusion with Data Flow Graph where nodes 
are program statements, with statements A and B connected by an arc if a variable assigned in A 
is used in B. 
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FIGURE 1. Data Transfer Graph of internal loop of 
vectorized autosorting FFT from FT benchmark. 

The Affinity Chain Rule. Consider two statements in a loop 
do i = 1 , n 

b ( idxbl ( i ) ) =c ( idxc ( i ) ) 

! Af f (b, c) ={ (idxbl (i) ; idxc (i) ) } 
a ( idxa ( i ) ) =b ( idxb2 ( i ) ) 

! Af f ( a , b) = { (idxa(i) ;idxb2 (i) ) } 
end do 

The chain rule facilitates the expression of affinity relation for indirectly 
dependent arrays: 

Af f (a, c) = { ( idxa ( i) ; idxc ( j ) ) , 

j = max{J: J<=i , idxbl (J) ) =idxb2 ( i )} } 

The following example further illustrates how to iterate affinity rela- 
tions: 

do i = 1 , n 

c ( idxcO ( i ) ) =d ( idxd ( i ) ) 

! Af f (c, d) = { (idxcO (i) ; idxd(i) ) } 
b ( idxbl ( i ) ) =c ( idxc ( i ) ) 
a ( idxa ( i ) ) =b ( idxb2 ( i ) ) 
end do 

Af f (a, c) ={ (idxa (i) ; idxc ( j ) ) , 

j = max { J : J<=i , idxbl (J) ) =idxb2 ( i )} } 

Af f (a, d) ={ (idxa (i) ; idxd(k) ) , 

k=max{K : {K<=i, idxcO (K) =idxc ( j ) , 

where j = max{J: J<=i , idxbl (J) ) =idxb2 ( i )} } 

The affinity chain rule helps to compute the closure of affinity rela- 
tions on a loop DTG. If applied to an array assigned in the loop, it will ex- 
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press in affinity relations of an array with each array it depends on. In 
general, due to the :r.ax operation involved in the chain rule, the affinity 
relation is too complex to be expressed explicitly 1 . In practice, the affinity 
relation often can bo expressed explicitly or can be approximated by a 
simple explicit relation. For example, if array indices are linear functions 
of the loop index, then the affinity relation can be represented by a linear 
mapping between array indices. The majority of affinity relations we en- 
countered fall into three classes. 

One-to-one affinity relations. This type of affinity relations can be trans- 
lated into the alignment providing communication-free computations. As 
shown in Figure 2, each element of a depends on a single element of b and 
a single element of c. The HPF alignment directive asserts that these ele- 
ments are aligned. As result, no communications are necessary (regard- 
less of the distribution). For the translation to be possible at most one 
subscript coefficient can be different from {+1,-1} 2 . 

! HPF$ ALIGN A(i) WITH B(i+3) 

! HPF$ ALIGN B(j) WITH C ( 5* (n-j ) -4 ) 

! HPF$ DISTRIBUTE C (BLOCK) 
do i=l,n 

b(n-i) =c (5*i-4) 
a(i-l) =b(i+2) 
end do 

c Af f ( a , c ) = { ( i-1 ; 5 * j -4 ) , j=max{ J : J<=i , n-J=i+2 } } 

c or Af f (a, c ) = { ( i-1 ; 5* j -4 ) , j =max{ J :n-2<=2 *i , J=n-i-2 
c or Af f (a , c ) = { ( i-1 ; 5*n-5*i-14 ) , n-2<=2*i} 


FIGURE 2. Translation of one-to-one affinity relations into an alignment 
providing communication free computations. Note that the alignment 
statement for c is stronger then required by Aff(a,c). 


Stencil Affinity Relations. The most common case we observe in our ap- 
plications is one-to-few affinity relations between arrays (few means a 
fixed number, independent on the array sizes and the number of loop it- 


1. In Section 4. we show that the problem of checking that an element of multidimensional array 
is affine with an element of another array is NP-complete. 

2. The syntax of the ALIGN directive ([14], p. 116) allows to use only align dummies in align 
subscript list and integer expressions of align dummies in align subscript. 
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erations). Explicit difference operators for structured discretization grids, 
for example, give rise to such relations. These relations can be approxi- 
mated by a stencil and we call them stencil relations. To optimize align- 
ment for a stencil relation we note that for block distribution the message 
size per partition point is the sum of distances of the alignment point from 
the other stencil points, see Figure 3. To minimize the message size we use 
bisectors (points splitting stencil points into two sets of equal size) as align- 
ment points. So, for stencil affinity relations, we generate alignment as 
bisectors of the affinity points in the alignee. A cyclic distribution for sten- 
cil affinity relations would create an order of magnitude more communi- 
cations than block distribution and is not considered as a valid option. 


partition point 


ID grid 



^ Alignment point, sum of distances = 3 

- 4 — 4 — 4 — 


1 word to communicate 
0 I \ 2 words to communicate 


0 words to communicate 


3 words to communicate per partition point 

FIGURE 3. The message size per partition point is sum of distances of the 
alignment point from the other stencil points. 


True dependence. If there is a directed path in DTG which starts and 
ends in the same node, then there is a dependence carried by the loop. 
This path allows us to iterate the chain rule. As a result, each array ele- 
ment depends on multiple elements of the same array. There are two im- 
portant cases: the dependence is "true" when previously computed array 
elements are used and "anti" when a used array element is overwritten, 
[12]. The anti dependence can be removed from the loop by making an ex- 
tra copy of the array. If the true dependence has a constant step d then a 
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cyclic distribution onto d processors would create communication free 
computations. 


4. Loop Nest Data Alignment and Distribution 

If arrays referenced in a loop nest have more than one dimension (3 
and 4 dimensional arrays are most common for structured grid CFD) then 
we must consider all loops surrounding computations with the arrays to 
obtain complete data affinity. In CFD applications, loop index bounds are 
usually linear functions of the surrounding loop indices, hence the nest in- 
dex domain can be described as a set of integer points in a polyhedron 1 . 

The chain rule for loop nests is similar to one for loops: 

do 10 I from PI 

b ( idxbl ( I ) ) =c (idxc (I) ) 
a ( idxa ( I ) ) =b ( idxb2 ( I) ) 

10 continue 

Aff (a, c) ={ (idxa (I) ;idxc (J) ) , 

J=max{ j : j <=I , idxbl ( j ) — idxb2 ( I ) } } ( ) 

where the max operation and inequality {j<=U are performed in the lexi- 
cographical order imposed by the nest indices. For statements with differ- 
ent nesting the chain rule is similar: 

do 10 I from PI 
do 20 J from PJ 

b( idxbl (I, J) ) =c ( idxc ( I , J) ) 

20 continue 

do 30 K from PK 

a ( idxa ( I , K) ) =b ( idxb2 ( I , K) ) 


1 We will use multidimensional indices, functions and domains in this section. It will make 
presentation more compact and the analogy between loops and nests more transparent. For exam- 
pie, instead of 

do i=l , nx 

do j=l,ny(i) 

do k=l ,nz < i / j ) . , 

a(idxal(i, j,k) # idxa2(i, j,k) , idxa3 U,D ,<> )- 

b( idxbl (i, j f k) , idxb2 (i, j /k) , idxb3 ( 1 , D »k) ) 

end do 
end do 
end do 


we will write 

do 10 I from PI 

a ( idxa ( I ) ) =b ( idxb ( I ) ) 
10 continue 
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30 continue 

10 continue 

Aff (a,c)={ ( i dxa ( I , K ) ;idxc(L,J) ) , 

(L, J) =max{ (1 , j ) : 1<=I, idxbl (1, j ) =idxb2 (I, K) } } 

Example. The problem of checking that an element of array is affine with an 
element of another array is NP-complete. We show that the Boolean Knapsack 
Problem is a special case of calculating affinity for an array element in the 
above shown relation (*). Let nest loop bounds are 0 and 1, meaning that 
PI is a boolean cube. Let I=(l,...,l) which means that the inequality (j<=V 
is true for all j from PI. In this case if idxbl(j) is an arbitrary linear form of 
j then calculating the value of J in (*) is a special case of the Boolean Knap- 
sack Problem. In practice, however, the number of array dimensions is 
fixed (it does not exceed 7 for CFD codes) and index functions are linear 
functions. In this case the affinity relation can be calculated in polynomial 
time by Lenstra's algorithm, see [22]. 

The chain rule allows the construction of an affinity relation for each 
directed path coming to a from b and passing only through privatizable 
arrays. The union of these relations over all directed paths to a from b 
yields the final affinity relation between a and b. The relation lists all ele- 
ments of b used for computation of the element of a and can be considered 
as one-to-many mapping. 

In most practical cases, this is a one-to-few stencil relation; a bisector 
of the stencil gives an optimal alignment. Many CFD stencils have a center 
of symmetry which can be used as the alignment point, see Figure 4. 


1. Bisector of a finite set in n-dimensional space is a point such that each coordinate hyperplane 
passing through it bisects the set. 3-point LU stencil shows that not every set has a bisector. 
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! HPF$ ALIGN C(i,*,*) WITH A(i-1,*,*) 

! HPF$ DISTRIBUTE (BLOCK, *,* ) :: A 

do k=2 , nz 
do j = 1 , ny 
do i = 1 , nx 

A(i-1, j ,k) =0.2* (C(i, j ,k) +C(i, j-l,k)+ 

> C(i, j+1, k) +C(i, j , k-1) +C (i, j ,k+l) ) 

end do 
end do 
end do 

C Af f ( A , C ) = { (i-l,j,k;i,j,k) 

C Sc(i-1, j ,k;i, j-l,k)&(i-l, j,k;i, j + 1 # k ) 

C &(i-l,j / k;i,j,k-l)&(i-l,j,k;i,j,k+l) } 

C Stencil Affinity (1,0,0), <1,-1,0), (1,1,0), 

c (1,0,-1), (1,0,1) 

C with bisector (1,0,0) 

FIGURE 4. Optimizing communications by choosing stencil 
bisector as the alignment point. 

Alignment with Systems of Linear Forms. Each array reference in the 
loop nest defines a mapping of the nest domain into array index space. 

Here we consider linear mappings: 

d = Al + b 

where d and I are array and the nest indices respectively, A is a matrix with 
constant elements and & is a vector with constant elements. 

For linear index functions, the affinity relation can be expressed in the 
form: 

Aff (x lf x 2 )={ (A 1 *I+b 1 ;A 2 *I+b 2 ) , I from PI} 

We want to translate this relation into alignment directives of the form 

! HPF$ ALIGN Y1 (il, ..., in) WITH 

Y2 (ml*jl+cl, . . . ,mn* jn+cn) 

where mp...,m n are integer multipliers, cp...,c n are integer shifts, (ip...,i n )-> 
( jv—'jn ) a dimension permutation, Yj and are Xp x 2 or auxiliary tem- 
plates. 

In a common case, when one matrix (say Al) is nonsingular, the rela- 
tion may be written explicitly: 
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Aff (x 1 ,x 2 ) ={ (d 1 ;A 2 A 1 ' 1 d 1 -A 2 A 1 " 1 b 1 +b 2 ) ,d 1 from A^I+b^ (1) 

If the matrix A 2 Af^ can be transformed to an integer diagonal matrix di- 
ag (mj r ...,m n ) by a permutation of the columns, then -A 2 A^ *bi+b 2 = 
(cy...,c n ) is an integer vector and the relation can be translated to an align- 
ment directive: 

! HPF$ ALIGN xl ( il in) WITH 

x2 (ml* j 1+cl , . . . , mn* jn+cn) !{2) 

where is the column permutation. 

If the matrix A 1 A{ 1 cannot be reduced to a diagonal form by permu- 
tation of columns, then (1) would require presence of general linear forms 
in the align subscript list which is not permitted in HPF, see [14], p. 116. 
The relation (1) cannot be expressed by HPF ALIGN directive also if both 
A 2 Ai and AjAf 2 have noninteger elements. In such a case we can look 
for a submatrix of A : and A 2 having the property. If such submatrix exists, 
the alignment is performed on the corresponding set of indices. 

The generation of alignment directives uses the alignment graph derived 
from DTG. The nodes of the alignment graph are non privatizable arrays 
of DTG. An arc connects two nodes having a directed path passing 
through privatizable arrays only. We annotate each arc of the alignment 
graph with a list of closures of affinity relations along each simple path 
connecting the arrays in DTG 1 . A closure of an affinity relation along a 
path is a result of successive application of the chain rule along arcs of the 
path. 

For each arc in the alignment graph, we analyze the affinity relations 
attached to it. If all relations are expressed in the form (1) and have the 
same dimension permutation, then we will generate directive (2). In the 
directive, each multiplier is the greatest common divisor of multipliers of 
the relations, the shift is a bisector of the relations shifts and (i 1 ,...,i rt )-> 

1. This is the most expensive operation of the method. It involves a few matrix multi- 
plications for computation of the set of directed paths and has complexity 0(n ), where 
n is the number DTG nodes. 
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(jp...,j n ) is the common dimension permutation. 

If there are directed paths connecting A and B in both directions, then 
there is self affinity for A (and B). The self affinity relation cannot be trans- 
lated into alignment; it has to be translated into an array distribution. If 
the dimension mapping is not identity, then the self affinity relation is too 
complex to be handled and the array is flagged as nondistributed. 

The self affinity relation can be iterated by applying the chain rule 
along a circular path A->B->A multiple times. This creates affinity of each 
element of A with a number elements of A. In general, the iterated affinity 
is too complex and the array is also flagged as nondistributed. In the ma- 
jority of practical cases, the affinity is simple and can be iterated explicitly 
and translated into a distribution minimizing communications. Assume 
that the self affinity relation can be expressed with linear forms: 

Aff (A,A)={ (I;C*I-z) , I from PI} 

Application of the chain rule gives 

Aff (A, A) = { ( I ; C* J- z ) , 

J=max{j: j <=I , j =C*I-z } , I, J from PI} 

It shows that the affinity relation is void if C*I-z>I. In this case, the relation 
is an anti-dependence and can be eliminated by making an extra copy of 
A. If C*I-z<I and C is a nonsingular matrix (this property is necessary for 
dropping the max operation in the relation) then the relation can be fur- 
ther simplified: 

Aff (A, A) ={ (I;C* (C*I-z) -z) , I,C*I-Z from PI} 

and for k iterations we have 

Aff (A,A) = { ( I ; J ) , J=C k *I- (C k_1 +. . -+l)*z) , 

I , C _1 * J+z from PI} 

In most practical cases, C is the unit matrix and the relation can be further 
simplified: 

Aff (A,A)={ (I;I-k*z) , I , I- (k-1) *z from PI} 

If there are a number of circular paths which start (and end) with A, 
the relations along the paths can be combined: 
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( 3 ) 


Af f (A, A) = { ( I ; I-v) , v=k 1 *z 1 +. . .+k q *z q , 

~L,v-Zi from PI} 

where Zp.-^Zj are so called dependence vectors, and l<=l<=q. This type of 
dependence is considered in [2]. 

Let the affinity relation along one path from B to A be 
Af f (B, A) = { (I; idxa (I) ) , I from IP} 
and idxa(I)<=I. Application of the chain rule to (3) shows that the same 
vectors are dependence vectors for B. 

Af f (B , A) = { (I; J-v) , v=k 1 *z 1 +. . .+k q *z q/ 
v-z 1 and I from PI, 

J=max{j: j=idxa ( I) , j<=I} }= 

= { (I; idx(I) -v) , v=k 1 *z 1 +. . .+k q *z q , 
v-z^ and I from PI} 

Suppose there is a dimension orthogonal to the dependence vectors, 
meaning that all components of the vectors in the dimension are zero. Dis- 
tribution of the orthogonal dimension results in assigning of all affine ar- 
ray elements to the same processor. Only inter-array communications will 
be necessary for the computations. These communications can be opti- 
mized with appropriate alignment as discussed above. 

If an orthogonal dimension does not exist, no HPF distribution would 
put all affine elements of A onto the same processor and every HPF distri- 
bution of A would have dependences between distributed sections of A. 
In this case, the generation of HPF directives should take into account the 
compiler's ability to pipeline the computations. If the compiler is able to 
perform pipelined processing of the sections with dependences, then the 
distributed dimension should be chosen to minimizes section dependenc- 
es (the dimension with the least number of dependence vectors having 
nonzero components). 

If the HPF compiler is not able to pipeline computations, the loop nest 
should be reordered to perform computations with independent subset of 
elements. Indices of the subsets can be identified by equation T*I = const, 
where T is a time vector leaving all dependences in the past: 


\ 
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T*z j<0,...,T*z q <0. 

In the case when an orthogonal vector exists, we generate a template 
and align both A and B with it. A maximal equivalence class of graph 
nodes having directed paths in both directions is called strongly connect- 
ed component" of the graph. The set of strongly connected components 
forms a directed acyclic graph, and we attach a template to each node of 
the graph. The affinity relation for loop nest then expressed as alignment 
of each array of the strongly connected component with the template, as 
shown in Figure 5. 

Generation of alignment statements for each connected component of 
a directed graph with affinity relations attached to every arc is performed 
in three steps. 

• 1. A common template is generated for all leafs of the compo- 
nent. Each leaf is connected to the template with arc and appro- 
priate affinity relation attached to the arcs. 

• 2. A rooted spanning tree is constructed for each component with 
the template as the root. 

• 3.For each non root node of the spanning tree, the alignment 
directive is generated for the arc leading from the node to the 
root (darker arcs in the Figure 6) 
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! HPF$ TEMPLATE tmpl_nes t_4 1 ( 64 , 64 ) 

!HPF$ DISTRIBUTE (BLOCK, BLOCK) :: tmpl_nes t_4 1 

! HPFS ALIGN FR ) WITH tmpl_nest_41 ( : , : ) 

!HPF$ ALIGN WITH FR ) : : ZX , ZY , ZZ , XX , XY , XZ , YX , YY , YZ 

! HPFS ALIGN Q( WITH FR ) 

DO 3 2 K=KLOW, KUP , 1 

K?1=KPLU5(K) !K+1 

KM1=KMINUS (K) ! K-l 

DO 32 J=2 , JM, 1 

3Z1=ZX( J, K, L) **2+ZY( J, K, L) **2+ZZ ( J, K, L) **2 

RHO=Q (J,K,L,1)*Q(J,K,L,6) 

U=XT+ ( XX ( J , K , L ) * Q ( J , K , L , 2 ) +XY ( J , K , L ) *Q { J , K , L , 3 ) +XZ ( J, K, L) *( 
+ ,K,L, 4) > 'Q ( J, K, L, 1) 

V=YT+ ( YX { J , K , L ) *Q ( J , K , L , 2 ) +YY ( J, K, L) *Q ( J, K, L, 3 ) +YZ ( J , K, L) *( 
+ ,K,L,4) ) /Q(J,K,L, 1) 

Sl=-RHO~ZX ( J, K, L) * (U* (Q{ J+l, K, L, 2 ) /Q { J+l , K, L, 1) -Q ( J-l , K, L, ! 
+ Q(J-1,K,L,1))*0. 5+V* (Q ( J, KP1 , L, 2) /Q ( J , KP1 , L , 1 ) -Q ( J, KM1 ,L , 2 

+ ( J , KMl , L , 1) ) *0 .5) 

52 = -RHO* ZY ( J, K, L ) *(U*(Q(J+l,K,L,3)/Q(J+l,K,L,l)-Q(J-l,K r L,j 

+ Q (J-l , K, L, 1 ) ) *0 . 5+V* ( Q ( J , KPl , L , 3 ) /Q { J , KPl , L , 1 ) -Q ( J , KMl , L , 3 
+ (J,KM1,L, 1) ) *0.5) 

53 = -RHO w ZZ (J,K,L) * {U* (Q ( J+l , K, L, 4) /Q ( J+l , K, L, 1) -Q(J-1,K,L , < 

+ Q(J-1,K,L,1) ) *0. 5+V* (Q{J,KP1,L,4) /Q(J,KP1,L, 1) -Q{J,KM1,L, 4] 

+ ( J, KMl , L, 1 ) ) *0 . 5 ) 

R1=S1+S2+S3 

FR (J,K,L) = (- 2. *R1 /BZl + 4 . *FR ( J, K, LI ) -FR ( J, K , L2 ) ) / 3 . 

32 CONTINUE 

CONTINUE 


FIGURE 5. Example of the generated directives for one of the nests of 
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FIGURE 6. The data transfer graph of the nest of the 
Figure 5. Dark arcs show a spanning tree. 
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5. Subroutine Level Data Distribution 

For data distribution on the subroutine level, we build a phase control 
flow graph (PCFG) [15]. The graph nodes are loop nests having at least one 
nonprivatizable array (following [15] we skip the loops (usually iteration 
loops) with the index not used as an array index). Two loop nests are con- 
nected by an arc if there is a possible transition from the last BB of one loop 
nest to the first BB of another loop nest. Given an alignment graph and the 
ALIGN and DISTRIBUTE directives (mapping directives) for each nest of 
PCFG; is it possible to combine the directives for a pair of adjacent nests 
in PCFG? 

In simple cases (if alignment is the same in both nests) the answer can 
be obtained by comparing distributed dimensions of arrays in each nest. 
This approach, however, does not always work. In general, the answer 
has to be obtained by attaching the second alignment graph to the first 
graph and getting a mapping for the combined graph (in the same way as 
we did it for loop nest graphs). If merging the mapping directives does not 
decrease the number of distributed dimensions, the scope of the mapping 
can be expanded to include both nests. Otherwise, the second nest should 
be combined with another adjacent nest. If the common number of distrib- 
uted dimensions is 0, then either a redistribution between the nests or 
pipeline the computations with data having dependence between sec- 
tions. We use a simple trade-off model between redistribution and pipe- 
lining to choose between these alternatives. The redistribution of an array 
of size N on p processors requires communication of N-N/p elements. The 
pipelined computation requires to communicate f*N*p/d elements, where 
d is the array size in the pipelined direction and /is the number of the de- 
pendence vectors. The cost of pipeline startup is a factor of 1+pd/N of the 
execution time. A more precise performance model that takes into account 
the pipeline blocking factor, the startup and latency of the communica- 
tions and overlapping of the communications with the pipelined compu- 
tations and will be included in the next release of the tool. A similar model 


16 



is used to trade-off redistribution and serial execution of a nest. 

Example. Some important CFD applications use ADI (alternative di- 
rection implicit) solvers [20]. ADI solves in x-, y- and 2- directions succes- 
sively. Solver in a particular direction has dependences in one direction 
and prohibits distribution in this direction (if pipelining is not available). 
This means that there exists no single HPF distribution for an ADI solver. 
Solvers in x- and y- directions have 2 as common distribution dimension 
and solvers in y- and 2- direction have x as common distribution dimen- 
sion. In an ADI solver, the redistribution can be done either between x- 
and y- solvers or between y- and 2- solvers. An estimation of work neces- 
sary for redistribution affects the choice of appropriate redistribution 
point. Based on the analysis of the communication required for redistribu- 
tion, x- and y- solvers were chosen to work with data distributed in the 2- 
direction in ARC3D. 

The propagation of the mapping directives along arcs of PCFG results 
in annotation of each source and sink of the graph with the directives. 
These annotations, together with information on in/ out/ through subrou- 
tine arguments and on the subroutine common blocks are used for deriv- 
ing of the final redistributions and subroutine interfaces. The HPF 
standard requires a subroutine to preserve the distribution of arrays visi- 
ble to other subroutines. To comply with this requirement we perform ad- 
justment of the annotations: 

• For each array argument, we choose the mapping of the array at 
one of the source node of PCFG; include this mapping (as pre- 
scriptive mapping) at the subroutine interface; and remove the 
mapping directives from the source node. 

• For each leaf node of PCFG, we compare the final mapping of 
each subroutine argument with the mapping on the leaf and 
restore the mapping if necessary. 

• For each array on a common block used in other subroutines, we 
first inquire distribution of the array with HPF mapping inquiry 
functions. On each leaf node of the PCFG we restore the map- 
ping if necessary. 
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6. Interprocedural Data Distribution 

In the considerations above, we assumed that statements do not in- 
clude subroutine or function calls. To remove this assumption, we can ei- 
ther inline the subroutine or use HPF subroutine interface to express the 
data mapping in the subroutine. The inlining requires the same analysis 
at each call statement and may not result in any useful distribution if there 
is no single data distribution inside the routine. The use of HPF subrou- 
tine interface limits the expression of the data mapping through the rou- 
tine interface. The data mappings in subroutine must comply with the 
HPF requirement of preserving data distributions by a callee. 

In our tool, we have used the interprocedural data distribution meth- 
ods developed at [9], [13], [15] and [19]. At each call site, the interproce- 
dural analysis provides the mapping of dummy arguments onto actual 
arguments. This mapping is used to attach the subroutine alignment 
graph to the loop nest alignment graph and transform the mapping de- 
clared at subroutine interface into the nest mapping. If the mappings are 
compatible, meaning that the number of distributed dimensions for com- 
bined graphs is larger than 0, then the scope of both mappings can be com- 
bined, otherwise data redistribution at each call site will be necessary. 

In many CFD programs, such as ARC3D, argument redimensioning 
is a common construct. According to the HPF standard, in this case (see 
[14], p. 162), both actual and dummy arguments must be declared as se- 
quential. Passing array by reference to its first element is also not permit- 
ted in HPF. The appropriate array section is required instead. To comply 
with this HPF requirement, we had to rewrite by hand some subroutine 
calls with passing an appropriate array section instead of an array ele- 
ment. 

7. Automatic Data Alignment and Placement Tool (ADAPT) 

The data alignment and distribution techniques described in previous 
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sections have been implemented in ADAPT. The tool is written in C++ 
and is based on a few standard C++ classes such as List, Point, Vector and 
Matrix. Some advanced classes such as Polynomial (symbolic polynomi- 
al), SLForm (system of linear forms) and DGraph (NAS Directed Graph 
class) have been implemented and widely used in the tool. The burden of 
FORTRAN program parsing, analysis and code generation is placed on 
CAPTools [11]. 

CAPTools (Computer Aided Parallelisation Tools) have been devel- 
oped in University of Greenwich, UK [11]. CAPTools demonstrated the 
ability to parse, analyze and parallelize a number of CFD applications, in- 
cluding NPB and ARC3D. As a result of an agreement between the Uni- 
versity of Greenwich and NASA Ames Research Center, the CAPTools 
group provided Parallel Tools group at NAS with an API. This includes a 
description of internal data structures used by CAPTools, internal pro- 
gram representation (application data base), a number of utilities and a 
code generator. 

ADAPT uses the CAPTools generated database to perform a single 
pass through the source program. It builds a PCFG for the whole applica- 
tion and a data transfer graph (DTG) for each loop nest. It annotates each 
arc with the affinity relation between arrays representing the arc ends. 
The complexity of processing a nest with n arrays is 0(n 4 ) and is dominat- 
ed by computing the closure of affinity relations. As a result, data align- 
ment directives are generated for each nest. The directives are then lifted 
bottom up along the arcs of PCFG by creating subroutine interfaces and 
either merging directives or placing redistribution directives. 

8. Experiments with CFD Codes 

We have performed some initial evaluation of the data distributions 
generated by ADAPT. The tool was run on NPB working on single struc- 
tured grid: BT, SP, LU and FT and on an aerodynamic application ARC3D. 
A qualitative comparison with the data distributions used in handwritten 
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HPF implementations of NPB [7] and ARC3D [8] is given in Table 1. All 
applications except LU use redistribution of data. The redistributions and 
their locations in the code have been successfully determined by ADAPT 
(linel). In all applications except SP some distributed arrays are passed as 
subroutine arguments. ADAPT was able to use the interprocedural infor- 
mation generated by CAPTools to move the distributions across subrou- 
tine boundaries and generate HPF subroutine interfaces (lines 2 and 3). 
For simple dependences between distributed array sections (BT,SP,LU 
and ARC3D) ADAPT was able to ignore the redistributions and leave the 
pipelining to the compiler, line 4. ADAPT was not able to generate redis- 
tributions of some boundary data necessary for efficient computations of 
boundary conditions (BC) in ARC3D, line 5 (note that BC was excluded 
from the plot on Figure 7). Based on the analysis of index expressions and 
loop nest indices ADAPT was able to detect and skip iteration loop, line 
6, as well to perform qualification of privatizable arrays, line 7. Non of the 
considered applications benefit from a cyclic distribution and ADAPT 

was not set to generate it, lines 8 and 9. 

TABLE 1 . ADAPT (A) versus Manual (M) HPF Data Distribution for 
Scientific Codes. [+ uses the feature, - does not use the feature, * 
depends on compiler support, / automaticly generated] 


Benchmark 

BT 

SP 


FT 

ARC3D 

DD Features 

□ 

D 

M 

D 

M 

D 

D 

D 

D 

D 

1. Redistribution 

+ 


D 

1 

- 

- 

4 - 

/ 

4 - 

/ 

2. Interprocedural 

+ 

/ 

- 

- 

+ 

/ 

4 - 


4 - 

/ 

3. Interfaces 

+ 

H 

■ 

- 

+ 

/ 

4 - 

/ 

+ 

/ 

4. Pipeline 3 

■ 

H 

■ 

/ 

* 

/ 

- 

- 

* 

/ 

5. BC redistribution 

- 

■ 


- 

- 

- 

- 

- 

+ 

- 

6. Time loop invariant 

+ 



/ 

□ 


H 

/ 

4 - 

/ 

7. Privatization (new) 

+ 

H 




/ 


/ 

4 - 

/ 

8. Block distribution 

+ 


H 

/ 

4- 

✓ 

+ 

/ 

4 - 

/ 

9. Cyclic distribution 


- 

- 

- 

- 


_ 

- 

- 

* 


a. The feature can be used if the compiler is able to support pipelining 
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The worst case complexity of 0(n 4 ) for computing the closure of the 
affinity relation (where n is the maximum number of nodes in the nest 
DTG) was never reached. Table 2. The execution time (line 6) was domi- 
nated by other factors such as computing of the affinity relations from in- 
dex expressions and the lifting of the directives along edges of PCFG. The 
complexity of these operations is proportional to the number of arcs in 
DTG (line 3) and in PCFG (line 4) respectively. Overall ADAPT execution 
time was significantly less than the CAPTools (line 5) analysis time. 

TABLE 2. ADAPT Performance 


Benchmark 

BT 

SP 

LU 

FT 

ARC3D 

1. Number of subroutines 

48 

33 

34 

31 

33 

2. Number of nests 

165 

51 

43 

17 

82 

3. Max size of DTG (nodes,arcs) 

(30,381) 

(29,148) 

(39,480) 

(12,16) 

(48,201) 

4. Size of PCFG (nodes,arcs) 

(165,220) 

(173,229) 

(174,208) 

(85,122) 

(253,297) 

5. CAPTools analysis time (min.) 

72 

67 

26 

30 

23 

6. ADAPT CPU time (sec.) a 

3 

3 

14 

1 

6 


a. The execution time is on 150 MH SGI R5000 machine, including time for code generation and excluding 


time for creating CAPTools data base. 

Finally we have applied some hand editing to the code generated by 
ADAPT for FT 1 and ARC3D. The editing was necessary for replacing in 
subroutine arguments array element addresses by array sections and for 
implementing the REDISTRIBUTION statement since the pghpf 2 . 4 does 
not support it. The implementing redistribution included coping distrib- 
uted arrays to arrays with an alternative distribution and substituting the 
arrays with alternative distribution instead of original arrays in the scope 
of the REDISTRIBUTION directive. The resulted code was compiled with 
the pghpf 2 . 4 compiler from Portland Group Inc. The performance of the 
code was comparable with the performance of the handwritten HPF code 
and with the MPI code. Figure 7. 


1. An inspection of the DTG and PCFG of the FT suggested that one of three 3D complex arrays is 
redundant. A removing of this array from the benchmark reduced the memory requirements by 
30% and slightly improved performance. 
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FIGURE 7. Comparison of MPI version (dotted curve), handwritten HPF version 
(solid curve) and AUAPT generated (dashed curve) versions of FT (left) and ARC3D 
(right). The boundary condition subroutine excluded from ARC3D plot since it requires 

a number of hand tuning steps. 


9. Automatic Data Distribution Techniques 

A number of methods for automatic data mapping have been de- 
signed: [2],[5],[13],[15],[16] / [17] j ,[18] / [23]. Some of the techniques were de- 
veloped in a framework of an automatically parallelizing compiler, others 
in the context of parallelizing tools. An extensive survey of data layout 
methods is given in [15] and a case study of 4 approaches is given in [3]. 
General requirements to data distribution tools are listed in [21]. We will 
concentrate on techniques suitable for distribution of data defined on a 
single or multiple structured grids. 

Most of the existing tools (with CAPTools an exception) were imple- 
mented as a "demonstration of concept" and none of them have demon- 
strated ability to analyze medium or large size codes and generate an 
efficient HPF program. 

An approach to data layout based on a decomposition of procedures 
into phases and finding the best static alignment for each phase was de- 
veloped by Li and Chen in [16]. The algorithm performs inter-dimension- 
al alignment as a first step and intra-dimensional alignment as the second 
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step. The inter-dimensional alignment is formulated as a partition prob- 
lem for the Component Affinity Graph (CAG). The authors propose a heu- 
ristic algorithm to find the best alignment, in general, however, they show 
that the problem is NP-complete. 

Paradigm [18]. The approach is based on the analysis of the communi- 
cation graph generated by Parafrase-2. The graph nodes are program 
statements and graph edges are data flows between statements weighted 
with cost of the communications. The graph is recursively decomposed 
into a hierarchy of phases by removing a maximal cut on each step. The 
decomposition of the communication graph stops at the point when a stat- 
ic distribution can not be improved by further decomposition. Then a 
phase transition graph is built with the edges weighted by the cost of re- 
distributions. A critical path in the graph gives the best sequence of phases 
and phase transitions. The tool have been successfully applied to 2D FFT 
and ADI kernels. 

SUIF [2],[17],[23]. An algorithm for dynamic data decomposition is 
given in [2]. It is applicable to an arbitrary sequence of loop nests with 
loop boundaries and array references described by linear functions. It in- 
volves 3 main steps: 1. finding communication free decomposition, 2. if 
such a decomposition can not be found the algorithm searches for a de- 
composition with pipelined communications, 3. if such partition can not 
be found the algorithm applies a heuristic to group the nests to find a par- 
tition with pipelined communications within each group and redistribut- 
ing data between the groups. The algorithm was enhanced in [17] to find 
partitions minimizing synchronizations. The SUIF is a C compiler, re- 
quires use of f 2c to generate intermediate C code from FORTRAN code 
and is not able to generate HPF code. 

dHPF [13], [15]. The approach consists of reduction of the data distri- 
bution problem to a Boolean optimization problem and applying of a 
commercial package (CPLEX) for solving it. The reduction proceeds in a 
number of steps. On the first step the program is partitioned into phases. 
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Then for each phase a CAG is built. The partitions of CAG are candidate 
data layouts. The optimal layout is a critical path in the data layout graph 
which nodes are the candidate layouts and edges are possible remappings 
of the layouts between the phases. The edges are weighted with an empir- 
ical estimation of the remapping cost. The resulting optimization problem 
is then formulated as 0-1 programming problem and solved with aid of 
CPLEX. The tool was able to generate alignment and distribution state- 
ments for ADI kernel and Erlebacher and Tomcatv benchmarks. 

CAPTools [6], [11], [12]. CAPTools has an ability to apply block, cyclic 
and block/ cyclic distributions to data defined on structured [6] and on 
unstructured [11] grids. The distribution requires the user to specify an ar- 
ray and a dimension to be distributed. As soon a distribution have been 
defined a MPI code implementing "owner computes" rule is generated. 

10. Conclusions and Future Work 

We described methods for translating data dependence and affinity 
relations into HPF data mapping directives. These methods have been 
used for generating data distributions for HPF versions of NPB [7] and 
ARC3D [8]. We believe that we found a fine line between useful generality 
and intractable complexity in analyzing and treating data affinity. Our al- 
gorithms for program analysis and derivation of data distributions are ef- 
ficient three-pass algorithms. The majority of algorithms have linear 
complexity with exception of some graph algorithms having complexity 
0(n 4 ) ( n is the number of variables used in the program nests) in the worst 
case. 

We implemented the methods in an Automatic Data Alignment and 
Placement Tool (ADAPT). Initial comparison shows that the data map- 
pings generated by ADAPT are very close to the data mapping directives 
used in hand written HPF version of BT, LU and ARC3D. We aim ADAPT 
at real CFD applications such as OVERFLOW [10]. 
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